12 research outputs found

    Genomic Reconstruction of the Tree of Life

    Get PDF
    A new methodology is presented for molecular phylogenetic analysis addressing a fundamental problem in biology, name the reconstruction of the Tree of Life (TOL). Here, phylogenies are based on patterns of hybridization similarity in their DNA. Furthermore, phylogenies are based on a set of universal biomarkers (so-called nxh chips) chosen a priori, independently of the target group of organisms. Therefore, this methodology enables analyses of groups with biologically distant organisms, hence could be scaled to obtain a universal tree of life. Unlike conventional molecular methods, it produces a hypothesis in a single run, without optimizing across numerous hypotheses for consensus. Prototype hypotheses agree with the biological Ground Truth in over 70% of the relationships. Higher quality nxh chips are likely to produce better hypotheses, but more difficult to design

    BIOMOLECULE INSPIRED DATA SCIENCE

    Get PDF
    BIOMOLECULE INSPIRED DATA SCIENC

    Towards reliable microarray analysis and design

    No full text
    Microarray studies represent an extraordinary advance but are notoriously noisy, produce results hardly reproducible, and are therefore unreliable at best. We present a framework that affords a quantitative quantification of their inherent noise and (in)accuracy and, more importantly, a methodology to remove noise and improve their reliability and reproducibility. Furthermore, we present a principled methodology to design a variant of new generation microarray (nxh-chips) that are provably reliable by two new metrics of reliability. Potential impact is illustrated with a new approach to phylogenetics that produces a universal tree of life (TOL) based on more objective criteria and is feasibly scalable to entire genomes, while being implementable on current microarray biotechnology. A preliminary TOL is actually shown as a phylogenetic analyses of a selection of 22 species/ biomarkers from the standard Biomarker Identification Numbers (BOLD/BINs). Conclusions include some discussion of theoretical and experimental problems for implementation of this methodology in practice

    Deep structure of DNA for genomic analysis

    No full text
    Recent advances in next-generation sequencing, deep networks and other bioinformatic tools have enabled us to mine huge amount of genomic information about living organisms in the post-microarray era. However, these tools do not explicitly factor in the role of the underlying DNA biochemistry (particularly, DNA hybridization) essential to life processes. Here, we focus more precisely on the role that DNA hybridization plays in determining properties of biological organisms at the macro-level. We illustrate its role with solutions to challenging problems in human disease. These solutions are made possible by novel structural properties of DNA hybridization landscapes revealed by a metric model of oligonucleotides of a common length that makes them reminiscent of some planets in our solar system, particularly Earth and Saturn. They allow a judicious selection of so-called noncrosshybridizing (nxh) bases that offer substantial reduction of DNA sequences of arbitrary length into a few informative features. The quality assessment of the information extracted by them is high because of their very low Shannon Entropy, i.e. they minimize the degree of uncertainty in hybridization that makes results on standard microarrays irreproducible. For example, SNP classification (pathogenic/non-pathogenic) and pathogen identification can be solved with high sensitivity (~77%/100%) and specificity (~92%/100%, respectively) for combined taxa on a sample of over 264 fully coding sequences in whole bacterial genomes and fungal mitochondrial genomes using machine learning (ML) models. These methods can be applied to several other interesting research questions that could be addressed with similar genomic analyses

    Towards a universal genomic positioning system: Phylogenetics and species identification

    No full text
    Technology to gather biomic data now far exceeds the capabilities of tools to extract useful information and knowledge from it, a challenging predicament facing demands in our time, such as personalized medicine. We propose a new family of data structures to represent and process omics data in a way that is more anchored in biological reality and processed by algorithms that are more consistent with it, so that DNA itself can be used to process it to extract useful knowledge, organize and store it as needed. These structures enable much more efficient crunching of genomic and proteomics data and can be used as a foundation of a truly universal Genomic Positioning System (GenIS). The power of this approach is illustrated by applications to two important problems in biology, a new universal set of biomarkers and methods to do phylogenetic analysis and species identification and classification. We show that certain metrics on these representations can be used to obtain ab initio, from genomic data alone (possibly including full genomes), in a matter of minutes or hours, well established and accepted phylogenies crafted in biology (such as the 16S rRNA-based plylogenies) in the course of the last 50 years. We also show how the same representation can also be used to solve recognition problems associated with genomic data, which includes in particular the problem of species identification and a solution to the problem of storing large genomes into compact representations while preserving the ability to query them efficiently. We also sketch other applications to be explored in the future, including objective criteria to produce biological taxonomies to produce a truly universal and comprehensive ā€œAtlas of Lifeā€, as it is or as it could be on earth

    Molecular computing approaches

    No full text
    Molecular approaches exploit structural properties built deep into DNA by millions of years of evolution on Earth to code and/or extract some significant features from raw datasets for the purpose of extreme dimensionality reduction and solution efficiency. After describing the deep structure, it is leveraged to render several variations of this theme. They can be used obviously with genomic data, but perhaps surprisingly, with ordinary abiotic data just as well. Two major families of techniques of this kind are reviewed, namely genomic and pmeric coordinate systems for dimensionality reduction and data analysis

    Information-theoretic approaches

    No full text
    An entirely different but extremely relevant approach to dimensionality reduction can be taken using a different criterion, namely quantifying the information content of the features involved, within themselves or in relation to others. It turns out that Shannon\u27s definition of information yields surprisingly interesting reductions. This chapter discusses five major variations of this idea, including comparisons using the concept of mutual information previously used in statistics and machine learning

    Foretelling the Phenotype of a Genomic Sequence

    No full text
    Estimating phenotypic features (physical and biochemical traits) in a biological organism from their genomic sequence alone and/or environmental conditions has major applications in anthropological paleontology and criminal forensics, for example. To what extent do genomic sequences generally and causally determine phenotypic features of organisms, environmental conditions aside? We present results of two studies, one in blackfly (Insecta:Diptera:Simuliidae) larvae in two species (Simulium ignescens and S. tunja) with four phenotypic features, including the area and spot pattern of the cephalic apotome (in the form of a latin cross on the dorsal side of the head), the postgenal cleft (area under the head on the ventral side) and general body color in larva specimens; the second in strains of Arabidopsis thaliana. They establish that a substantial component of these phenotypic features (over 75 percent) are at least logically inferable, if not causally determined, by genomic fragments alone, despite the fact that these phenotypic features are not 100 percent determined entirely by genetic traits. These results suggest that it is possible to infer the genetic contribution in the determination of specific phenotypic features of a biological organism, without recourse to the causal chain of metabolomics and proteomic events leading to them from genomic sequences

    Classifying single nucleotide polymorphisms in humans

    No full text
    Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation amongst the human population and are key to personalized medicine. New tests are presented to distinguish pathogenic/malign (i.e., likely to contribute to or cause a disease) from nonpathogenic/benign SNPs, regardless of whether they occur in coding (exon) or noncoding (intron) regions in the human genome. The tests are based on the nearest neighbor (NN) model of Gibbs free energy landscapes of DNA hybridization and on deep structural properties of DNA revealed by an approximating metric (the h-distance) in DNA spaces of oligonucleotides of a common size. The quality assessments show that the newly defined PNPG test can classify a SNP with an accuracy about 73% for the required parameters. The best performance among machine learning models is a feed-forward neural network with fivefold cross-validation accuracy of at least 73%. These results may provide valuable tools to solve the SNP classification problem, where tools are lacking, to assess the likelihood of disease causing in unclassified SNPs. These tests highlight the significance of hybridization chemistry in SNPs. They can be applied to further the effectiveness of research in the areas of genomics and metabolomics

    Profiling Environmental Conditions from DNA

    No full text
    DNA is quintessential to carry out basic functions by organisms as it encodes information necessary for metabolomics and proteomics, among others. In particular, it is common nowadays to use DNA for profiling living organisms based on their phenotypic traits. These traits are the outcomes of the genetic makeup constrained by the interaction between living organisms and their surrounding environment over time. For environmental conditions, however, the conventional assumption is that they are too random and ephemeral to be encoded in the DNA of an organism. Here, we demonstrate that, to the contrary, genomic DNA may also encode sufficient information about some environmental features of an organismā€™s habitat for a machine learning model to reveal them, although there seem to be exceptions, i.e. some environmental features do not appear to be coded in DNA, unless our methods miss that information. Nevertheless, we demonstrate that these features can be used to train better models for better predictions of other environmental factors. These results lead directly to the question of whether over evolutionary history, DNA itself is actually also a repository of information related to the environment where the lineage has developed, perhaps even more cryptically than the way it encodes phenotypic information
    corecore